import pandas as pd
from matplotlib import pyplot as plt
from datetime import timedelta
import seaborn as sns
plt.style.use('seaborn')
%matplotlib inline
The first dataset is an export of my ride data from Strava, an online social
network site for cycling and other sports. This data is a log of every ride since the start of 2018
and contains summary data like the distance and average speed. It was exported using
the script stravaget.py which uses the stravalib module to read data. Some details of
the fields exported by that script can be seen in the documentation for stravalib.
The exported data is a CSV file so that's easy to read, however the date information in the file is recorded in a different timezone (UTC) so we need to do a bit of conversion. In reading the data I'm setting the index of the data frame to be the datetime of the ride.
strava = pd.read_csv('data/strava_export.csv', index_col='date', parse_dates=True)
strava.index = strava.index.tz_localize('UTC')
strava.head()
The second dataset comes from an application called GoldenCheetah which provides some analytics services over ride data. This has some of the same fields but adds a lot of analysis of the power, speed and heart rate data in each ride. This data overlaps with the Strava data but doesn't include all of the same rides.
Again we create an index using the datetime for each ride, this time combining two columns in the data (date and time) and localising to Sydney so that the times match those for the Strava data.
cheetah = pd.read_csv('data/cheetah.csv', skipinitialspace=True)
cheetah.index = pd.to_datetime(cheetah['date'] + ' ' + cheetah['time'])
cheetah.index = cheetah.index.tz_localize('Australia/Sydney')
cheetah.head()
The GoldenCheetah data contains many many variables (columns) and I won't go into all of them here. Some that are of particular interest for the analysis below are:
Here are definitions of some of the more important fields in the data. Capitalised fields come from the GoldenCheetah data while lowercase_fields come from Strava. There are many cases where fields are duplicated and in this case the values should be the same, although there is room for variation as the algorithm used to calculate them could be different in each case.
Some of the GoldenCheetah parameters are defined in thier documentation.
Your first task is to combine these two data frames using the join method of Pandas. The goal is to keep only those rows of data
that appear in both data frames so that we have complete data for every row.
The goal is to keep only those rows of data that appear in both data frames so that we have complete data for every row.
# Combined data that appear in both data
result = pd.concat([strava, cheetah], axis=1, join='inner')
result.head()
# Showing the dimensional of the new combined data
result.shape
Race, Workout and Ride.What leads to more kudos? Is there anything to indicate which rides are more popular? Explore the relationship between the main variables and kudos. Show a plot and comment on any relationship you observe.
Generate a plot that summarises the number of km ridden each month over the period of the data. Overlay this with the sum of the Training Stress Score and the average of the Average Speed to generate an overall summary of activity.
Generate a similar graph but one that shows the activity over a given month, with the sum of the values for each day of the month shown. So, if there are two rides on a given day, the graph should show the sum of the distances etc for these rides.
Hint: to generate these summary plots you need to use the timeseries/date functionality in Pandas to generate a new data frame containing the required data.
Note: once you have completed these steps you can remove this cell. Use the text as a starting point for the documentation of your workflow and discussion of results.
Remove rides with no measured power (where device_watts is False) - these are commutes or MTB rides.
Also removing rows that has missing data (NaN data)
# Remove rides with no measured power (where device_watts is False)
result_true = result[result['device_watts']==True]
# Remove rides with missing data
result_true = result_true.dropna()
Look at the distributions of some key variables: time, distance, average speed, average power, TSS. Are they normally distributed? Skewed?
# Plotting the histogram and boxplo for time, distance, average speed, average power, TSS
f, axes = plt.subplots(2, 5,figsize=(20, 6))
sns.distplot(result_true['moving_time'], ax=axes[0,0])
sns.distplot(result_true['distance'], ax=axes[0,1])
sns.distplot(result_true['Average Speed'], ax=axes[0,2])
sns.distplot(result_true['Average Power'], ax=axes[0,3])
sns.distplot(result_true['TSS'], ax=axes[0,4])
sns.boxplot(result_true['moving_time'], ax=axes[1,0])
sns.boxplot(result_true['distance'], ax=axes[1,1])
sns.boxplot(result_true['Average Speed'], ax=axes[1,2])
sns.boxplot(result_true['Average Power'], ax=axes[1,3])
sns.boxplot(result_true['TSS'], ax=axes[1,4])
From the histogram and boxplot diagrams above, it can be seen that moving_time, Distance, Average Power, and TSS variables are not normally distributed, instead they are skewed to the right since the box in the boxplot diagrams are not located on the middle but slightly to the left. While the Average Speed data tends to be normally distributed.
Explore the relationships between the following variables. Are any of them corrolated with each other (do they vary together in a predictable way)? Can you explain any relationships you observe?
# Extracting the columns that are going to be analysed
result_true['elevation_gain_number'] = result_true['elevation_gain'].str.extract('(\\d+)').astype(int)
analyze = result_true[['Distance', 'moving_time', 'Average Speed', 'average_heartrate', 'average_watts', 'NP', 'TSS', 'elevation_gain_number']]
# Define the correlation coefficient function
from scipy.stats import pearsonr
def corrfunc(x, y, **kws):
(r, p) = pearsonr(x, y)
ax = plt.gca()
ax.text(0.1, 0.1, "{0:.2f}".format(r),
transform=ax.transAxes,
color='black', fontsize=60)
# Plotting the scatter plot and the correlation coefficient
g = sns.PairGrid(analyze)
g.map_lower(sns.regplot, color="0",line_kws={"color":"r","alpha":0.7,"lw":2}, scatter_kws={'s':5})
g.map_upper(corrfunc)
The scatter plot pictures how each variables interact. For example in the first scatter plot which is between the moving_time and Distance, we can see as the Distance increasing the moving_time is also increasing, therefore they have a positive correlation as they move the same.
In the diagrams above, the correlation coefficients are also shown, we can easily analyse the realationship between the variables based on the number. A correlation coefficient of 1 is total high positive correlation, -1 is total negative correlation, and 0 represents no correlation.
From the information and diagram above, it can be concluded that these are the variables with high correlation (correlation above 0.50 or below -0.50):
We want to explore the differences between the three categories: Race, Workout and Ride.
# Make a new dataset for the key variables to be observed later
differences = result_true[['workout_type', 'elapsed_time', 'Distance', 'average_temp', 'moving_time', 'Average Speed', 'average_heartrate', 'average_watts', 'NP', 'TSS', 'elevation_gain_number', 'Calories (HR)']]
# To have a quick analysis using a PairGrid for the key variables
g = sns.PairGrid(differences, hue="workout_type")
g = g.map_diag(plt.hist, histtype='step', lw=3)
g = g.map_offdiag(plt.scatter, s=15, alpha=0.5)
g = g.add_legend()
#Saving figure to a file so it can be clearly seen in the pdf file
g.savefig('plot/Portofolio 1 plot.pdf')
We don't have the same amount of data for each category therefore there might be a bias in the analysis, in fact there is only 5 data for the workout category. But here is the analysis from the histogram and scatter plot above:
# Correlation between NP and TSS
gr = sns.lmplot(x="NP", y="TSS", hue="workout_type", data=result_true)
# Correlation between average_heartrate and average_watts
gr = sns.lmplot(x="average_heartrate", y="average_watts", hue="workout_type", data=result_true)
# Correlation between Calories and average_watts
gr = sns.lmplot(x="Calories (HR)", y="average_watts", hue="workout_type", data=result_true)
sns.boxplot(data=result_true, x='kudos', y='workout_type')
race = result_true.loc[result_true['workout_type'] == 'Race']
workout = result_true.loc[result_true['workout_type'] == 'Workout']
ride = result_true.loc[result_true['workout_type'] == 'Ride']
print ("Ride mean", ride["kudos"].mean())
print ("Race mean", race["kudos"].mean())
print ("Workout mean", workout["kudos"].mean())
From the box plot and mean above, it can be conclude that the race category gains highest kudos followed by Ride then the least is Workout
'workout_type', 'elapsed_time', 'Distance', 'average_temp', 'moving_time', 'Average Speed', 'average_heartrate', 'average_watts', 'NP', 'TSS', 'elevation_gain_number', 'Calories (HR)'
sns.lmplot(x="kudos", y="elapsed_time", data=result_true)
sns.lmplot(x="kudos", y="Distance", data=result_true)
sns.lmplot(x="kudos", y="average_temp", data=result_true)
sns.lmplot(x="kudos", y="moving_time", data=result_true)
sns.lmplot(x="kudos", y="Average Speed", data=result_true)
sns.lmplot(x="kudos", y="average_heartrate", data=result_true)
sns.lmplot(x="kudos", y="average_watts", data=result_true)
sns.lmplot(x="kudos", y="NP", data=result_true)
sns.lmplot(x="kudos", y="TSS", data=result_true)
sns.lmplot(x="kudos", y="elevation_gain_number", data=result_true)
sns.lmplot(x="kudos", y="Calories (HR)", data=result_true)
From the scatter diagram above it can bee seen that some kudos have a positive relationship with these variables:
While in the other categories, it does not seems to be really significant
import pandas as pd
from matplotlib import pyplot as plt
from datetime import timedelta
plt.style.use('seaborn')
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import r2_score
import seaborn as sns
from scipy.stats import pearsonr
import seaborn as seabornInstance
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.svm import LinearSVC
from sklearn.feature_selection import RFE
from sklearn import datasets
%matplotlib inline
The purpose of this experiment is to understand the relationships between the appliances' energy consumption with different predictors. Using the data from the experiment, we would also want to be able to do a prediction of the appliances' energy consumption and also to obtain wich predictors give a significant effect to the energy consumption.
The energy consumption information is collected with an internet-connected energy monitoring system where it is collected in 10 minutes interval, and then stored and reported by e-mail every 12 hour.
#Import the energydata_complete.csv and use 'date' as the index of the rows
energy = pd.read_csv('data/energydata_complete.csv', index_col='date', parse_dates=True)
energy.index = energy.index.tz_localize('UTC') #Set column as time
#Extracting the year, month, date of the day, day, and hour for further data analysis later
energy['Year'] = energy.index.year
energy['Month'] = energy.index.month
energy['Day'] = energy.index.day
energy['Weekday Name'] = energy.index.weekday_name
energy['Hour'] = energy.index.hour
energy['Weekday Code'] = energy.index.weekday #for linear regression
# Giving code for weekend = 1 and weekday = 0
energy.loc[energy['Weekday Code'] > 4, 'Weekend or Not'] = 1
energy.loc[energy['Weekday Code'] <= 4, 'Weekend or Not'] = 0
# Calculating NSM
time = pd.to_datetime(energy.index)
energy['NSM'] = time.hour*36000+time.minute*60+time.second
energy.head()
The time span of the data set is 137 days (4.5 months). The diagram below shows the energy consumption for the whole period and also the first week period. From the both diagrams below, it can be seen that the energy consumption profile shows a high variability.
#Line plot over the experiment period
fig, ax = plt.subplots(figsize=(15,6))
energy['Appliances'].plot(linewidth=0.5);
ax.set_ylabel('Appliances Wh')
ax.set_xlabel('Time')
# Select 4 first week for further analysis later
firstweek=energy.loc['2016-1-11 00:00:00':'2016-1-17 23:59:59']
secondweek=energy.loc['2016-1-18 00:00:00':'2016-1-24 23:59:59']
thirdweek=energy.loc['2016-1-25 00:00:00':'2016-1-31 23:59:59']
fourthweek=energy.loc['2016-2-1 00:00:00':'2016-2-7 23:59:59']
# Line plot of the appliances energy consumption for 1 week
fig, ax = plt.subplots(figsize=(15,6))
firstweek['Appliances'].plot(linewidth=0.5);
ax.set_ylabel('Appliances Wh')
ax.set_xlabel('Time (1 Week)')
From the histogram and box plot diagram below, it can be seen that the appliances energy consumption data has a long tail. The median of the data represented by the blue line inside the box, where it is more dispered. We can also see that there are several outliers in the data shown by the circle.
plt.figure(figsize=(20,6))
plt.hist(energy['Appliances'],bins=100)
plt.show()
plt.figure(figsize=(20,3))
sns.boxplot(x=energy['Appliances'], vert=False)
It can be seen from the histogram and barplot above the energy data distribution has a really long tail
The pairplot diagrams below showing the relationship between all the variables with the energy consumption of appliances ('Appliances').
Scatter plot in the lower diagonal show how the two variables intreact. While in the diagonal, a distribution of the variables can be seen in the histogram diagram. Whereas in the upper diagonal, the Pearson correlation coeficient between the two variables shown. A correlation of 1 is total high positive correlation, -1 is total negatve correlation, and 0 represents no correlation.
# Defining the correlation fuction
def corrfunc(x, y, **kws):
(r, p) = pearsonr(x, y)
ax = plt.gca()
ax.text(0.1, 0.1, "{0:.2f}".format(r),
transform=ax.transAxes,
color='black', fontsize=60)
# Defining the function of pairplot diagram
def pairplotdiagram(variables):
g = sns.PairGrid(energy, vars=variables)
g.map_lower(sns.regplot, color="0",line_kws={"color":"r","alpha":0.7,"lw":2}, scatter_kws={'s':5})
g.map_upper(corrfunc)
g.map_diag(sns.distplot,hist_kws={"color":"cyan"},kde_kws={"color":"black","lw":1})
# Defining the variables in each pairplot diagram
list1=['Appliances', 'lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3']
list2=['Appliances', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6']
list3=['Appliances', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9']
list4=['Appliances', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'NSM','T6']
# Print the first pairplot diagram
pairplotdiagram(list1)
# Print the second pairplot diagram
pairplotdiagram(list2)
# Print the third pairplot diagram
pairplotdiagram(list3)
# Print the fourth pairplot diagram
pairplotdiagram(list4)
From the pairplot diagrams above, we can focus on the first line of each diagram to see the correlations coeficient between the appliances energy consumption (Appliances) and other variables. Where, correlation with the 'Lights' variables shown to be the highest with coeficient of 0.20
An hourly heat map was created for four consecutive weeks of the Appliances Energy Consumption to identify andy time trends. As can be clearly seen, there is a strong time component in the energy consumption pattern. The energy consumption start to rise around 6 in the morning. Then around noon, there are a significant escalation. In term of day of the week, the clear pattern can not be clearly identified.
# Summing Up the Energy Consumption of 10 minutes into an hour total, calculation done for each week
heatmap1=firstweek.groupby(['Day','Hour']).agg('sum')
heatmap2=secondweek.groupby(['Day','Hour']).agg('sum')
heatmap3=thirdweek.groupby(['Day','Hour']).agg('sum')
heatmap4=fourthweek.groupby(['Day','Hour']).agg('sum')
# Defining a fuction to produce the heatmap
def heatmapfigure(xxx):
#reset the index
heatmap1_=xxx.reset_index()
#Make a Pivot by Hour, Day, and the Value is the Appliances
heatmap_1 = heatmap1_.pivot("Hour", "Day", "Appliances")
#Make the heatmap
f, ax = plt.subplots(figsize=(2.5, 6))
g = sns.heatmap(heatmap_1, cmap="YlOrRd")
g.set_facecolor('#ffffcc')
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'] #To give day label on the heatmap
ax.set_xticklabels(days)
plt.xticks(rotation=90) #Rotate the label so it can be read clearly
heatmapfigure(heatmap1)
heatmapfigure(heatmap2)
heatmapfigure(heatmap3)
heatmapfigure(heatmap4)
From the heatmap diagram above, it can be seen that there is clearly a trend in the time axis, where the energy tends to increase around 6-8 am, and it goes low again around 10-11 pm. While in the day of the week, we can not clearly see a trend based on the day trend from the 4 week illustration.
We use linear regression to be the model that will be used to predict the energy consumption. The data separated to the train and test data, so the model later can be evaluated.
# Separating data randomly, 25% test, 75% train
train, test = train_test_split(energy, test_size=0.25)
# Check the number of the data for the test and train
print(test.shape)
print(train.shape)
We define the X as all the independent variables, and Y as the dependent variable (the outcome of independent variables)
#Defining X_train, y_train, X_test, y_test
X_train = train[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'Weekday Code', 'Weekend or Not', 'NSM']]
y_train = train[['Appliances']]
X_test = test[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'Weekday Code', 'Weekend or Not', 'NSM']]
y_test = test[['Appliances']]
The linear regression equation from the train data:
reg = linear_model.LinearRegression()
reg.fit(X_train, y_train)
print("y = x *", reg.coef_, "+", reg.intercept_)
We need to evaluate the RMSE, R squared, MAE, MAPE, and MSE to evaluate our linear model
# Predicting the outcome of the model from the test data
y_pred = reg.predict(X_test)
# Calculate the MAE, MSE, and RMSE
def mean_absolute_percentage_error(y_true, y_pred):
y_true, y_pred = np.array(y_true), np.array(y_pred)
return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print() #Space
print('R Squared:', r2_score(y_test, y_pred))
print() #Space
print('Mean Absolute Error (MAE):', metrics.mean_absolute_error(y_test, y_pred))
print() #Space
print('Mean Absolute Percentage Error (MAPE):', mean_absolute_percentage_error(y_test, y_pred),'%')
print() #Space
print('Mean Squared Error (MSE):', metrics.mean_squared_error(y_test, y_pred))
Analysis:
The number of our RMSE, Rsquared, MAE, and MAPE are about the same with what on the journal paper (there is a slightly different, but it happens because we separate our train and test data differently)
# Residual Plot where Y is the residual
ax1=sns.regplot(y_test['Appliances'], y_pred)
# Use linear regression as the model
lr = LinearRegression()
# Rank all features
rfe = RFE(lr, 1) # Using 1 so the ranking will be from 1 to 28
rfe.fit(energy[['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'Weekday Code', 'Weekend or Not', 'NSM']],energy[['Appliances']])
print(rfe.ranking_)
# Making a new dataframe to see it clearly
df = pd.DataFrame(rfe.ranking_, columns=['Ranking'],)
# Add new column: Features
df['Features'] = ['lights', 'T1', 'RH_1', 'T2', 'RH_2', 'T3', 'RH_3', 'T4', 'RH_4', 'T5', 'RH_5', 'T6', 'RH_6', 'T7', 'RH_7', 'T8', 'RH_8', 'T9', 'RH_9', 'T_out', 'Press_mm_hg', 'RH_out', 'Windspeed', 'Visibility', 'Tdewpoint', 'Weekday Code', 'Weekend or Not', 'NSM']
# Sort the features based on ranking
df.sort_values(by=['Ranking']).reset_index(drop=True)
It can be seen that the RFE from our data shows a difference from the journal paper.
K-means clustering is one of the simplest and popular unsupervised learning algorithms. Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes. This notebook illustrates the process of K-means clustering by generating some random clusters of data and then showing the iterations of the algorithm as random cluster means are updated.
We first generate random data around 4 centers.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors
from matplotlib import pyplot as plt
from sklearn.metrics import pairwise_distances_argmin
from matplotlib.colors import ListedColormap, LinearSegmentedColormap
%matplotlib inline
center_1 = np.array([1,2])
center_2 = np.array([6,6])
center_3 = np.array([9,1])
center_4 = np.array([-5,-1])
# Generate random data and center it to the four centers each with a different variance
np.random.seed(5)
data_1 = np.random.randn(200,2) * 1.5 + center_1
data_2 = np.random.randn(200,2) * 1 + center_2
data_3 = np.random.randn(200,2) * 0.5 + center_3
data_4 = np.random.randn(200,2) * 0.8 + center_4
data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)
plt.scatter(data[:,0], data[:,1], s=7, c='k')
plt.show()
You need to generate four random centres.
This part of portfolio should contain at least:
k is set to 4;centres = np.random.randn(k,c)*std + mean where std and mean are the standard deviation and mean of the data. c represents the number of features in the data. Set the random seed to 6.green, blue, yellow, and cyan. Set the edgecolors to red.centre_random = np.array([2,2])
# Generate random centres
np.random.seed(6)
randomcentre1 = np.random.randn(1,2) * 4 + centre_random
randomcentre2 = np.random.randn(1,2) * 4 + centre_random
randomcentre3 = np.random.randn(1,2) * 4 + centre_random
randomcentre4 = np.random.randn(1,2) * 4 + centre_random
centres = np.concatenate((randomcentre1, randomcentre2, randomcentre3, randomcentre4), axis = 0)
# Show new generated random centres in Cyan, Blue, Green, and Yellow
fig = plt.figure(figsize=(10,8))
plt.scatter(data[:,0], data[:,1], s=7, c='black')
plt.scatter(centres[:,0], centres[:,1], s=120, c=['cyan','blue','green', 'yellow'],edgecolors='red',linewidth='2', alpha=0.5)
plt.show()
You need to implement the process of k-means clustering. Implement each iteration as a seperate cell, assigning each data point to the closest centre, then updating the cluster centres based on the data, then plot the new clusters.
Replace this text with your explaination of the algorithm. The resulting notebook should provide a good explanation and demonstration of the K-means algorithm.
K-means is a method to cluster set of data into k number of cluster, where each data point is allocated to the nearest cluster that has similarity. In this notebook, the process of clustering using K-means algorithm will be demonstrated. The data that has been randomly generated in section 1 will be used and clustered into 4 cluster (k=4).
The mechanism of K-means:

# Our data and centres
data = np.concatenate((data_1, data_2, data_3, data_4), axis = 0)
centres = np.concatenate((randomcentre1, randomcentre2, randomcentre3, randomcentre4), axis = 0)
# Number of Cluster
k=4
#For Iteration Numbering
m=0
#Loop used to to the iteration
while True:
#Assign each data on 'data' to the nearest 'centres' to make a temporary cluster
cluster = pairwise_distances_argmin(data, centres)
#Look for new centres in each temporary cluster
new_centres = np.array([data[cluster == i].mean(axis = 0) for i in range(k)])
#Visualisation
fig = plt.figure(figsize=(5,3)) #Size of figure
#Iteration numbering
n=m+1
print ('iteration #',n)
#Colors set for each cluster
colors=['cyan','green','blue', 'yellow']
#Each point colored based on its cluster
plt.scatter(data[:,0], data[:,1], s=7, c=cluster, cmap=matplotlib.colors.ListedColormap(colors))
#Old Centres Colored in Black
plt.scatter(centres[:,0], centres[:,1], s=40, c='black',edgecolors='black',linewidth='2')
#New Centres Colored in its cluster color and with red edgecolor
plt.scatter(new_centres[:,0], new_centres[:,1], s=120, c=['cyan','green','blue', 'yellow'],edgecolors='red',linewidth='2', alpha=0.5) #New centres
plt.show()
if np.all(centres == new_centres): #When the new centres is the same as the previous centres, then final clustered is made
break
centres = new_centres
m=n
# Final Cluster
fig = plt.figure(figsize=(10,8))
plt.title("FINAL CLUSTERING")
#Each point colored based on its cluster
plt.scatter(data[:,0], data[:,1], s=7, c=cluster, cmap=matplotlib.colors.ListedColormap(colors))
# Centres of each cluster shown in red circle
plt.scatter(new_centres[:,0], new_centres[:,1], s=120, c=['cyan','green','blue', 'yellow'],edgecolors='red',linewidth='4', alpha=0.5)
plt.show()